Triton 编程入门：超越逐点操作—

虽然 逐点操作 对张量中的每个元素独立处理， 归约模式 引入数据依赖关系，将多个输入元素合并为一个输出值（例如求和、取最大值或平均值）。要高效实现这些操作，必须弥合理论上的二维数据结构与硬件内存中线性表示之间的差距。

二维张量在逻辑上是网格结构，但在物理内存中是线性排列的。理解 行优先 与 列优先 布局对于判断归约操作是否遍历连续内存地址，还是需要步进访问至关重要。

一个 矩阵复制 代表一种一对一（$1:1$）输入到输出映射的逐点操作。相比之下，一个归约是一种多对一（$N:1$）的操作，需要在线程间共享累加或在块内进行顺序处理。

归约操作由其轴决定。沿轴 1（行）与轴 0（列）进行归约，会从根本上改变内存步长模式和硬件缓存命中率。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

[Short Answer] [Short Answer] matrix copy

A matrix copy is a 1:1 pointwise operation; a reduction is a many-to-one operation requiring data synchronization.

QUESTION 2

Which memory layout is characterized by elements of the same row being stored in adjacent memory addresses?

Column-major

Row-major

Strided-major

Z-order curve

QUESTION 3

If we reduce a tensor of shape (M, N) across axis 1, what is the resulting shape?

(M, 1) or (M,)

(1, N) or (N,)

(1, 1)

(M, N)

QUESTION 4

Why is 'Bias Addition' considered a pointwise operation compared to 'Softmax'?

Bias addition requires every element in a row to be summed first.

Each output element in a bias add depends only on its corresponding input element and a constant.

Bias addition is performed in global memory only.

Softmax does not involve any exponentiation.

QUESTION 5

What is the primary architectural challenge when implementing a reduction in Triton?

Writing the result back to global memory.

Communicating or 'voting' across threads to find a single value (e.g., max).

Using the address-of operator.

Handling floating point addition.